-
Notifications
You must be signed in to change notification settings - Fork 30.9k
Chat response parsing #40894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chat response parsing #40894
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
1dfdb73 to
6610d38
Compare
|
cc @zucchini-nlp do you know what the popular VLMs using reasoning or tool calls are right now? I'd like to add support + testing in the processor too. |
584c843 to
73470de
Compare
|
From the ones we have already in library, Ovis2 chat templates has some parts for tool usage. Other than that, haven't seen explicit reasoning/tools in template |
|
Thank you! I'll take a look at Ovis2, if not I can worry about adding |
b8c0a94 to
ab99161
Compare
| "properties": { | ||
| "role": {"const": "assistant"}, | ||
| "content": {"type": "string"}, | ||
| "thinking": {"type": "string"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmn, I think we often use thinking because that's the key that chat templates use in their input! The idea here is that the returned dict should be ready to append to the chat history without further user intervention
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that makes sense - is that just for gpt-oss or have you seen other models adopt thinking too in their chat templates?
To clarify, my question was about whether returning reasoning_content would provide a drop-in replacement for the vLLM reasoning parsers. No strong opinion either way :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that you mention it, a lot of LLMs drop the thinking block entirely in their template, because they don't render thinking blocks from past turns. We could probably switch to reasoning_content without too much pain!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also +1 for reasoning_content. But there is also a chance here to standardize
Maybe along the line of https://standardcompletions.org/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is very useful for huggingface/trl#4115
|
Made a quick demo so reviewers can try this out. Just switch to the from transformers import pipeline
model_name = "Rocketknight1/qwen-response-test"
pipe = pipeline("text-generation", model_name, dtype="auto", device_map="auto")
prompt = "Hey, what's 6 * 8?"
messages = [
{"role": "user", "content": prompt}
]
out = pipe(messages, max_new_tokens=512)
print(out[0]["generated_text"][-1])If it works, you should see the output correctly split into And here's a quick demo for tool calling: from transformers import pipeline
model_name = "Rocketknight1/qwen-response-test"
def get_current_weather(location: str):
"""
Gets the weather at a given location
Args:
location: The location to get the weather for
"""
return 20.
pipe = pipeline("text-generation", model_name, dtype="auto", device_map="auto")
prompt = "Hey, what's the weather like in Paris?"
messages = [
{"role": "user", "content": prompt}
]
out = pipe(messages, tools=[get_current_weather], max_new_tokens=512)
print(out[0]["generated_text"][-1])The tool call should be correctly parsed as a key in the response dict. |
|
Hey, I've recently been working on the tool call handling in vLLM and came across this. Lots of really cool stuff going on here! A couple questions I have:
|
To this point, the Harmony library expects to deal directly with token ids, right? So if we allow integration with model provider libraries for this parsing, we may have to support token ids. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi there, I help with structured outputs and tool calling on vLLM.
I left some comments here. lmk if there is anything we can help with
There is also a wip for tool call parser on our end, that is largely depends on xgrammar structural tag, so just want to make sure that these wouldn't duplicate some of the work here as well.
| "properties": { | ||
| "role": {"const": "assistant"}, | ||
| "content": {"type": "string"}, | ||
| "thinking": {"type": "string"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also +1 for reasoning_content. But there is also a chance here to standardize
Maybe along the line of https://standardcompletions.org/
| Like chat templates, response schemas are set as a property of the tokenizer. To enable response parsing, all you need | ||
| to do is set `tokenizer.response_schema` to a valid schema dict, and `tokenizer.parse_response()` will work! Again, like | ||
| chat templates, this schema will be saved with the processor, so once you set it, you can use `save_pretrained()` or `push_to_hub()` to | ||
| save and share the schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my 2cent: I do think it might be beneficial to keep the parser implementation of tokenizer.parse_response to be in huggingface/tokenizers (i.e Rust implementation).
for openai/harmony format, they do seem also very performant
|
|
||
| ## Developers: Complex schemas | ||
|
|
||
| Now, let's look at a more complex schema, which includes tool calls, to gain more of an understanding of the parser |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might be also interested
|
|
||
| ## Developers: Understanding a simple response schema | ||
|
|
||
| Under the hood, `parse_response` uses a **JSON schema** to parse the model output. A JSON schema represents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should we handle some tool_calls that are in XML format?
For example, Qwen3-Coder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, our inputs/outputs are in JSON schema format, even when models render them in a different format. We expect the input to a chat template to be JSON schema, or equivalent Python, and the decoded output with chat parsing would be as well. This was to enable a consistent API across models.
This is true even when the model does something totally different, like rendering tool calls in XML! In that case, the chat template and parser should translate the standard API to XML and back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's likely (assuming we go with this feature and don't replace it with something more like Structural Tag) that we'd add an xml parser to the spec as well, like the json parser that already exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reference @aarnphm! This is a really cool and useful PR! We’ve also had discussions with the vLLM team about how to build a unified tool calling parser interface, and I think this will be very helpful as well.
In XGrammar, we previously implemented a Structural Tag whose goal is to describe various kinds of structures. With guided decoding, it ensures the output strictly follows the defined structure, while also leaving room for potential support of parsing output in the future. This will be merged into the vLLM/SGLang/TensorRT main branches within the next two weeks. Its docs can be found at https://xgrammar.mlc.ai/docs/tutorials/structural_tag.html. It also supports constraining output in XML format.
I think that also aligns very well with the goal of this PR. The difference is that this API focuses more on high-level semantics, whereas Structural Tag emphasizes structure at the raw text level. I believe there’s a lot of space for collaboration here, and I’d love to explore this further.
I also would like to see a push for xgrammar upstream as well here. |
|
@Ubospica yes, Structural Tag looks very similar to this! Constrained generation and output parsing obviously have a lot of overlap, since they both define an output schema of some kind. Do you know how output parsing was intended to be implemented for Structural Tag? |
|
Actually, a better question: If we want to align with XGrammar, should we try to extend the Structural Tag spec to allow output parsing, or should we have an output schema for parsing that's separate from the Structural Tag? |
|
I thought about it for a while and here's what I have: The main thing response parsing needs that structural tag doesn't have is a way to map segments of the output to paths in the parsed message dict. For example, it is very common that LLMs render tool calls or tool definitions as JSON schema, but not in the standard OpenAI format our API expects for tool calls. For example, they may rename "arguments" to "parameters", or they may leave out the "function" key, etc. The goal of chat parsing is that the model should return a message dictionary that is ready to be appended to the chat without any further parsing from the user, which means it must be in the standard API format. This means we need a way to map "segments" of the structural tag schema to segments of the output dict. If the model emits: The output we want from parsing is: It's easy to express the model output above as a structural tag schema, but it's trickier to do the mapping from there to the output format we want. It may even be strictly impossible in some cases - some of the structural tag keys like Maybe it makes more sense to keep response parsing separate, but I can upstream it or something like it to XGrammar as a separate feature, alongside structural tags for constrained generation? |
This comment was marked as resolved.
This comment was marked as resolved.
|
For information, multiple tool calls isn't supported yet: from transformers import pipeline
model_name = "Rocketknight1/qwen-response-test"
def get_current_weather(location: str):
"""
Gets the weather at a given location
Args:
location: The location to get the weather for
"""
return 20.
pipe = pipeline("text-generation", model_name, dtype="auto", device_map="auto")
prompt = "Hey, what's the weather like in Paris and London?"
messages = [
{"role": "user", "content": prompt}
]
tools = [get_current_weather]
out = pipe(messages, tools=tools, max_new_tokens=512)
print(out[0]["generated_text"][-1]) |
Thanks for the great discussion here! Previously xgrammar had a proposal for parsing some output text with the structural tag at mlc-ai/xgrammar#303. The API is like It's not finished yet, but it should be easy to pick it up recently. This maps the output text to the structure of the StructuralTag. To further use it in tool calling (and also parallel tool calling), we still need to map it into the OpenAI format (the I agree with you that with regex/ebnf the output is harder to parse. It's possible with xgrammar's builtin Earley parser, but may introduce multiple possible parsing result and we need to determine which one to use. Maybe we can restrict the parser with StructuralTags without any regex/ebnf content. One possible and suitable solution is, we can have an extra layer of abstraction for tool calling and further lower that into structural tag, and use the structural tag parser to handle that, then convert it to OpenAI tool calling format. The benefit could be: 1) it's easier to handle the different tool calling formats of different models; 2) we can unify the constraint decoding (or guided decoding) and tool calling parsing. |
ee0a466 to
6a8f095
Compare
|
Quick update here: I experimented with dropping this PR and instead adding parsing support to I think constrained generation schemas are still very useful! But I'm much less confident that it's a good idea to overload them with both tasks; there's actually a lot of friction between them. It probably makes more sense for models to have both a "response schema" and a "generation schema", and this PR will remain focused on the first. |
a68bd5b to
56be799
Compare
Co-authored-by: Quentin Gallouédec <[email protected]>
Co-authored-by: Joao Gante <[email protected]>
24f0143 to
084796c
Compare
|
Going to merge this one after addressing most of the reviewer comments! I think the docs could still use a little more cleanup, but the core is solid, and I'd like this to get into a release sooner rather than later so I can start merging PRs to add response schemas to models. Since only a few models will have response schemas to start and I'll likely be writing all of them, we can treat the feature as experimental/unstable for now, and if I need to change anything in the spec I can go back and change the schemas too. |
* Initial commit * Adding more tests, bugfixes, starting tool tests * Add support for JSON parsers and some tool tests * stash commit * stash commit * stash commit * stash commit * stash commit * Fix cohere schema, fix a lot of the recursive parser code * GPT-OSS passing too! * Update tests * make fixup * Offset tracking partially done * stash commit * stash commit * Assistant masking Just Works * make fixup * stash commit * stash commit * JMESPath approach * stash commit before i rip this PR apart * Remove broken offset code * Remove broken offset code * Update chat parsing code and add tests for Ernie + fix Cohere tests for new format * Implement tokenizer method * jmespath dependency handling * Completed TODOs * Add support to TextGenerationPipeline * Update GPT-OSS schema and test cases * make fixup * Fix typing (??) * missing future import * Use old typing in tokenization_utils_base.py * put jmespath in various extras * Remove accidental newline * Guard tests correctly * Remove require_jinja on the schema tests since we don't actually apply chat templates there * make fixup * fix some bad linter changes * Fix docstring * Push draft documentation * Extend tests, more documentation * make fixup * docs docs docs * Add Processor support * Add to toctree * Flag markdown correctly * Remove double backslashes in docs for simplicity * Simplify node-regex-to-dict * Add support to ImageTextToTextPipeline * Add support to ImageTextToTextPipeline and save/loading support in Processors * Begin reworking docs to start fitting in response parsing * Fix rebase * Expand documentation further * Expand documentation further * Refactor x-regex-to-dict to x-regex-key-value, update the parser logic docs section * Refactor x-regex-to-dict to x-regex-key-value, update the parser logic docs section * More docs update * Update TextGenerationPipeline to support tools properly * Some rebase fixes * Re-add is_jmespath_available * Re-add is_jmespath_available * Add Qwen3 parser and test, add maybe-json support * Rollback processor changes - we'll wait for legacy saving to be deprecated * Make fixup * Revert ImageTextToText changes for now * Add pipeline test * make fixup * Resolve a todo * Resolve more TODOs and clean up the spec a little * Add ref in the tools doc * Update docs/source/en/chat_response_parsing.md Co-authored-by: Quentin Gallouédec <[email protected]> * Update src/transformers/utils/chat_parsing_utils.py Co-authored-by: Joao Gante <[email protected]> * Add a docstring for parse_response * Add function docstring and reference it in the docs * Fix generate link * Revert Processor changes for now * Use updated GPT-OSS format * Print the dict keys instead of the whole dict so the example doesn't become too big --------- Co-authored-by: Quentin Gallouédec <[email protected]> Co-authored-by: Joao Gante <[email protected]>
This PR is a replacement for #39609. The idea is that models can include a message schema, allowing model output to be parsed into a structured form. The original plan was to allow parsing of the entire chat history, essentially the inverse operation of
apply_chat_template, but the schemas involved were too complex and there was no realistic hope that users would be able to write them!This PR simplifies things - we focus only on parsing the output generated by the model. This is mainly relevant for tool calling and chain of thought models, both of which emit structured output that often needs manual handling before it can be appended to the chat.
The output schema is stored as a key on the tokenizer. It consists of a JSON schema, representing the structure of messages emitted by the model, with additional
x-keys that indicate how parsing should be performed. Parsing is mostly done through regexes, but there is also support for common tool call formats likejsonto be directly parsed without you having to write an entire JSON regex parser 😅Work to do:
Support parsing in(Will move to separate PR)Processorclasses tooSupport parsing in(Will move to separate PR)ImageTextToTextpipelineDeepseek(tool calling isn't working in their template, will fix after)xmlsupport)Documentation to do:
parse_responseexplanation and show how it works withTextGenerationPipelinex-fieldsOpen questions:
chat_parsing_utils.pyintochat_template_utils.py?